-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add orient=tight format for dictionaries #35292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hmm +/- 0 on adding anything to to_dict here; we already have a lot of formats that kind of do the same things but not really and it's not always spelled out. While this could have some utility I think it adds to that confusion
If the ultimate goal is to include in the table schema most of that is written in Python, so maybe not as much of a concern? |
@WillAyd Here are two important comments in the original issue that this PR addresses: #4889 (comment) states (and I agree with this observation) that the #4889 (comment) states the issue with not preserving the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments, looks pretty good if you'd update the notes to 1.2
else: | ||
realdata = data["data"] | ||
|
||
def create_index(indexlist, namelist): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ensure_index does this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dr-Irv can you review this comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback ensure_index
doesn't handle the names of the indexes (or columns), so you'd still have logic to handle that part anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work
e..g just use Index
In [98]: pd.Index([(1, 0), (1, 1)], names=['foo', 'bar'])
Out[98]:
MultiIndex([(1, 0),
(1, 1)],
names=['foo', 'bar'])
In [99]: pd.Index([1, 2], names=['foo'])
Out[99]: Int64Index([1, 2], dtype='int64')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because note that the argument to set the names of a MultiIndex
is names
, while for an Index
it is name
. In your example in cell 99, you lost the name of the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grr this is a bug, actually is this still true on master (this was a pretty old version i am using)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still an issue. Using pd.Index
, with names
, you get a FutureWarning
. With name
, you get a stack trace.
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '1.4.0.dev0+905.gdace93d694'
In [3]: pd.Index([(1,0), (1,1)], names=["foo", "bar"])
<ipython-input-3-99fe0b033485>:1: FutureWarning: Passing keywords other than 'data', 'dtype', 'copy', 'name', 'tupleize_cols' is deprecated and will raise TypeError in a future version. Use the specific Index subclass directly instead.
pd.Index([(1,0), (1,1)], names=["foo", "bar"])
Out[3]:
MultiIndex([(1, 0),
(1, 1)],
names=['foo', 'bar'])
In [4]: pd.Index([(1,0), (1,1)], name=["foo", "bar"])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-138c3fe61553> in <module>
----> 1 pd.Index([(1,0), (1,1)], name=["foo", "bar"])
c:\Code\pandas_dev\pandas\pandas\core\indexes\base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
403 from pandas.core.indexes.range import RangeIndex
404
--> 405 name = maybe_extract_name(name, data, cls)
406
407 if dtype is not None:
c:\Code\pandas_dev\pandas\pandas\core\indexes\base.py in maybe_extract_name(name, obj, cls)
6876 # GH#29069
6877 if not is_hashable(name):
-> 6878 raise TypeError(f"{cls.__name__}.name must be a hashable type")
6879
6880 return name
TypeError: Index.name must be a hashable type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm we removed names=
. i still find this a bit of a footgun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i think we need to handle this a bit bitter if you wouldn't mind opening an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Seems like we ought to have another issue with respect to pd.Index([(1,0), (1,1)], name=["foo", "bar"])
not returning a MultiIndex
??
@jreback before I make the change you suggested, when we had the dev sprint on 9/4/20, we discussed this PR, and it was suggested that I add an argument such as So what is your opinion about introducing |
@Dr-Irv what's the status here? |
I've been waiting for @jreback to respond to this comment: #35292 (comment) I would like to introduce |
looks ok, can you merge master and will look |
@jreback have merged with master and all checks passed. Changed it from draft status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good a comment
else: | ||
realdata = data["data"] | ||
|
||
def create_index(indexlist, namelist): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dr-Irv can you review this comment here
thanks @Dr-Irv |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The issue in #4889 contains examples of how the JSON format doesn't support
MultiIndex
for the index or the columns, independent of theorient
chosen. Also, none of the orientations support index names (orMultiIndex
names). Finally, if you want to have a JSON format that is "tight" (or "compact"), thesplit
is the closest thing, but it is incomplete due to the indexing issues.As a first step to addressing this, I have created a
orient='tight'
option toDataFrame.to_dict()
andDataFrame.from_dict()
. If we agree that this is a reasonable representation, the next step (which I'd prefer to do in a second PR) would be to also support it with JSON. The challenge there is thatto_json()
is currently implemented in C. Right now, one can just usejson.dumps(df.to_dict(orient='tight'))
to get the needed JSON as a workaround.I'm open to changing the word
'tight'
to something else, but I wanted it to start with a different letter than the other orientations.